In this notebook I train a few models (described later) and I investigate their variable importance. The models are trained to predict houses' prices based upon the following features:
1) XGBoost
2) Random forest
Then I train the same models, but I learn them to predict $log(price)$. That's because the distribution of prices is very far from normal: almost all samples are centered near the medium prices, but there are some few times more expensive samples. So I thought that taking the logarithm of the prices might help I wanted to check this out.
3) XGBoost on $log(price)$
4) RandomForrest on $log(price)$
| MAE | |
|---|---|
| XGBRegressor | 68,680$ |
| RandomForestRegressor | 70,045$ |
| XGBRegressor_LogTransform | 68,370$ |
| RandomForestRegressor_LogTransform | 71,335$ |
Taking logarithm of prices slightly improves XGBoost model, by the difference is so small that it might be up to random sampling. All four models have similar perfomance (on the test set selected from the data set in the Appendix).
We can see that 4 features: lat, long, sqft_living, grade are most important for all models, they only differ in the order. However, lat is the most important for all of them. Waterfront is 5th most important for three of these models.
These results are consintent with my previous findings on instance level: SHAP and LIME, where I compared three houses with very different prices (very low price, average price and extremaly high price).
Diffently than for variable importance, long is more important than lat. But in general we can see that grade, sqft_living and geographical location are the most important from the point of view of the whole model as well as from the point of view of (deliberately selected) samples.
The fact that latitude is more important at global level while longitude seems to be more important at instance level needs some explanation. It happened that my three sample instances lie at similar latitude, so only changing their longitude can push them towards centre of the town. But if I chose house lying at the bottom of the map, latitude would be probably the most important.
Lets return to the Variable Importance. We recall that lat, long, sqft_living and grade are the most important. Impact of lat and long is shown above. The influence of grade and sqft_living is also very visible in the data:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as rmse
from sklearn.metrics import mean_absolute_error as mae
import xgboost
import dalex as dx
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
house_data = pd.read_csv("../../../Data/kc_house_data.csv")
house_data.head()
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | ... | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7129300520 | 20141013T000000 | 221900.0 | 3 | 1.00 | 1180 | 5650 | 1.0 | 0 | 0 | ... | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 1 | 6414100192 | 20141209T000000 | 538000.0 | 3 | 2.25 | 2570 | 7242 | 2.0 | 0 | 0 | ... | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 2 | 5631500400 | 20150225T000000 | 180000.0 | 2 | 1.00 | 770 | 10000 | 1.0 | 0 | 0 | ... | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 3 | 2487200875 | 20141209T000000 | 604000.0 | 4 | 3.00 | 1960 | 5000 | 1.0 | 0 | 0 | ... | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 4 | 1954400510 | 20150218T000000 | 510000.0 | 3 | 2.00 | 1680 | 8080 | 1.0 | 0 | 0 | ... | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
5 rows × 21 columns
X = house_data.drop(["id", "price", "date", "zipcode"], axis = 1)
cols_idx = {}
for col in X.columns:
cols_idx[col] = X.columns.get_loc(col)
# X = X.to_numpy()
y = house_data["price"]
y_binned = y.apply(lambda x: round(x/1e+5) if x<2e+6 else 21)
bins = y_binned.unique().tolist()
bins.sort()
X_train, X_test, y_train, y_test = train_test_split(X, y_binned, test_size = 0.3, stratify = y_binned, random_state = 1)
# now y_train is splitted y_binned. I want y, but with indexes from splitted y_binned
y_train = y.iloc[y_train.index]
y_test = y.iloc[y_test.index]
I train XGBoost model with parameters I found performing well in the previous homework
%%time
xgb_model = xgboost.XGBRegressor(colsample_bytree=0.6, eta=0.15, max_depth=5, gamma=0,
objective="reg:squarederror", random_state=1)
xgb_model.fit(X_train, y_train)
CPU times: user 3.3 s, sys: 31.1 ms, total: 3.33 s Wall time: 326 ms
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.6, eta=0.15, gamma=0,
gpu_id=-1, importance_type='gain', interaction_constraints='',
learning_rate=0.150000006, max_delta_step=0, max_depth=5,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
%%time
rf_model = RandomForestRegressor(n_jobs = -1, random_state=1)
rf_model.fit(X_train, y_train)
CPU times: user 10.6 s, sys: 83.4 ms, total: 10.7 s Wall time: 1.02 s
RandomForestRegressor(n_jobs=-1, random_state=1)
xgb_model2 = xgboost.XGBRegressor(objective="reg:squarederror", random_state=1)
xgb_model2.fit(X_train, np.log(y_train))
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=12, num_parallel_tree=1, random_state=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
class XGBRegressor_LogTransform:
def __init__(self, model):
self.model = model
def predict(self, X):
return np.exp(self.model.predict(X))
xgb_log_model = XGBRegressor_LogTransform(xgb_model2)
rf_model2 = RandomForestRegressor(n_jobs = -1, random_state=1)
rf_model2.fit(X_train, np.log(y_train))
RandomForestRegressor(n_jobs=-1, random_state=1)
class RandomForestRegressor_LogTransform:
def __init__(self, model):
self.model = model
def predict(self, X):
return np.exp(self.model.predict(X))
rf_log_model = RandomForestRegressor_LogTransform(rf_model2)
models = [xgb_model, rf_model, xgb_log_model, rf_log_model]
errors = dict()
for model in models:
y_hat = model.predict(X_test)
error = mae(y_test, y_hat)
errors[model.__class__.__name__] = error
performance = pd.DataFrame(index = errors.keys(),
data = {'MAE' : errors.values()})
performance.style.format('{:,.0f}$')
| MAE | |
|---|---|
| XGBRegressor | 68,680$ |
| RandomForestRegressor | 70,045$ |
| XGBRegressor_LogTransform | 68,370$ |
| RandomForestRegressor_LogTransform | 71,335$ |
exp_xgb = dx.Explainer(xgb_model, X, y)
Preparation of a new explainer is initiated -> data : 21613 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 21613 values -> model_class : xgboost.sklearn.XGBRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.04e+05, mean = 5.39e+05, max = 7.57e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.91e+06, mean = 6.93e+02, max = 2.19e+06 -> model_info : package xgboost A new explainer has been created!
exp_xgb.model_parts().plot()
exp_xgb_only_train = dx.Explainer(xgb_model, X_train, y_train)
exp_xgb_only_train.model_parts().plot()
Preparation of a new explainer is initiated -> data : 15129 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 15129 values -> model_class : xgboost.sklearn.XGBRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.04e+05, mean = 5.4e+05, max = 7.57e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -4.67e+05, mean = -1.87, max = 8.53e+05 -> model_info : package xgboost A new explainer has been created!
exp_xgb_only_test = dx.Explainer(xgb_model, X_test, y_test)
exp_xgb_only_test.model_parts().plot()
Preparation of a new explainer is initiated -> data : 6484 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 6484 values -> model_class : xgboost.sklearn.XGBRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.1e+05, mean = 5.37e+05, max = 4.88e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.91e+06, mean = 2.31e+03, max = 2.19e+06 -> model_info : package xgboost A new explainer has been created!
exp_rf = dx.Explainer(rf_model, X, y)
exp_rf.model_parts().plot()
Preparation of a new explainer is initiated -> data : 21613 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 21613 values -> model_class : sklearn.ensemble._forest.RandomForestRegressor (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 1.01e+05, mean = 5.4e+05, max = 6.75e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.54e+06, mean = 5.2e+02, max = 2.04e+06 -> model_info : package sklearn A new explainer has been created!
exp_xgb_log = dx.Explainer(xgb_log_model, X, y)
exp_xgb_log.model_parts().plot()
Preparation of a new explainer is initiated -> data : 21613 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 21613 values -> model_class : __main__.XGBRegressor_LogTransform (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 7.51e+04, mean = 5.34e+05, max = 7.46e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -2.25e+06, mean = 6.08e+03, max = 3.94e+06 -> model_info : package __main__ A new explainer has been created!
exp_rf_log = dx.Explainer(rf_log_model, X, y)
exp_rf_log.model_parts().plot()
Preparation of a new explainer is initiated -> data : 21613 rows 17 cols -> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray. -> target variable : 21613 values -> model_class : __main__.RandomForestRegressor_LogTransform (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_default at 0x7fc912a0a9d0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 9.25e+04, mean = 5.31e+05, max = 6.57e+06 -> model type : regression will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -1.37e+06, mean = 9.32e+03, max = 3.06e+06 -> model_info : package __main__ A new explainer has been created!